NLP With Transformers
We will use the HuggingFace implementation of transformers. Since we have already installed pytorch
, we can now install the transformers
package by doing
pip install transformers
It includes many pre-trained neural networks for performing a number of tasks involving natural language processing.
As a first example, let's detect the sentiment of a short text, using a pre-trained network.
import transformers as tr
sentiment = tr.pipeline('sentiment-analysis')
sentiment('CS440 is a great class!')
[{'label': 'POSITIVE', 'score': 0.9998645186424255}]
sentiment('Completing my BS degree in computer science took a lot of hard work.')
[{'label': 'NEGATIVE', 'score': 0.990814745426178}]
sentiment('But I am happy to have graduated.')
[{'label': 'POSITIVE', 'score': 0.9998417496681213}]
Let's try sentences from the preface of Stuart Russels new book Human Compatible: Artificial Intelligence and the Problem of Control.
russell = '''
This book is about the past, present, and future of our attempt to understand and create intelligence.
This matters, not because AI is rapidly becoming a pervasive aspect of the present but because it is the dominant technology of the future.
The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time.
We cannot predict exactly how the technology will develop or on what timeline.
Nevertheless, we must plan for the possibility that machines will far exceed the human capacity for decision making in the real world.
What then?
Everything civilization has to offer is the product of our intelligence; gaining access to considerably greater intelligence would be the biggest event in human history.
The purpose of the book is to explain why it might be the last event in human history and how to make sure that it is not.
'''
russell = [s for s in russell.split('\n') if len(s) > 0]
russell
['This book is about the past, present, and future of our attempt to understand and create intelligence.', 'This matters, not because AI is rapidly becoming a pervasive aspect of the present but because it is the dominant technology of the future.', "The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time.", 'We cannot predict exactly how the technology will develop or on what timeline.', 'Nevertheless, we must plan for the possibility that machines will far exceed the human capacity for decision making in the real world.', 'What then?', 'Everything civilization has to offer is the product of our intelligence; gaining access to considerably greater intelligence would be the biggest event in human history.', 'The purpose of the book is to explain why it might be the last event in human history and how to make sure that it is not.']
sentiment(russell)
[{'label': 'POSITIVE', 'score': 0.9988692998886108}, {'label': 'POSITIVE', 'score': 0.9920300245285034}, {'label': 'POSITIVE', 'score': 0.9992543458938599}, {'label': 'NEGATIVE', 'score': 0.996938943862915}, {'label': 'NEGATIVE', 'score': 0.9982810020446777}, {'label': 'NEGATIVE', 'score': 0.9920858144760132}, {'label': 'POSITIVE', 'score': 0.9977427124977112}, {'label': 'NEGATIVE', 'score': 0.9920915365219116}]
for words, sentiment_result in zip(russell, sentiment(russell)):
print('\n', words)
print(' ', sentiment_result['label'], sentiment_result['score'])
This book is about the past, present, and future of our attempt to understand and create intelligence. POSITIVE 0.9988692998886108 This matters, not because AI is rapidly becoming a pervasive aspect of the present but because it is the dominant technology of the future. POSITIVE 0.9920300245285034 The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time. POSITIVE 0.9992543458938599 We cannot predict exactly how the technology will develop or on what timeline. NEGATIVE 0.996938943862915 Nevertheless, we must plan for the possibility that machines will far exceed the human capacity for decision making in the real world. NEGATIVE 0.9982810020446777 What then? NEGATIVE 0.9920858144760132 Everything civilization has to offer is the product of our intelligence; gaining access to considerably greater intelligence would be the biggest event in human history. POSITIVE 0.9977427124977112 The purpose of the book is to explain why it might be the last event in human history and how to make sure that it is not. NEGATIVE 0.9920915365219116
This model was trained with SST-2 dataset that contains 11,855 sentences from Rotten Tomatoes movie reviews.
The transformers
package contains other pre-trained models, including the following one trained on multiple languages.
sentiment = tr.pipeline('sentiment-analysis', model='nlptown/bert-base-multilingual-uncased-sentiment')
sentiment(russell)
[{'label': '5 stars', 'score': 0.4648197293281555}, {'label': '3 stars', 'score': 0.42040956020355225}, {'label': '5 stars', 'score': 0.6286720633506775}, {'label': '3 stars', 'score': 0.3582031726837158}, {'label': '3 stars', 'score': 0.47712787985801697}, {'label': '1 star', 'score': 0.3162928521633148}, {'label': '5 stars', 'score': 0.5440354943275452}, {'label': '4 stars', 'score': 0.3630126714706421}]
sentiment.model
BertForSequenceClassification( (bert): BertModel( (embeddings): BertEmbeddings( (word_embeddings): Embedding(105879, 768, padding_idx=0) (position_embeddings): Embedding(512, 768) (token_type_embeddings): Embedding(2, 768) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) (encoder): BertEncoder( (layer): ModuleList( (0): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (1): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (2): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (3): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (4): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (5): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (6): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (7): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (8): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (9): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (10): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (11): BertLayer( (attention): BertAttention( (self): BertSelfAttention( (query): Linear(in_features=768, out_features=768, bias=True) (key): Linear(in_features=768, out_features=768, bias=True) (value): Linear(in_features=768, out_features=768, bias=True) (dropout): Dropout(p=0.1, inplace=False) ) (output): BertSelfOutput( (dense): Linear(in_features=768, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) (intermediate): BertIntermediate( (dense): Linear(in_features=768, out_features=3072, bias=True) ) (output): BertOutput( (dense): Linear(in_features=3072, out_features=768, bias=True) (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.1, inplace=False) ) ) ) ) (pooler): BertPooler( (dense): Linear(in_features=768, out_features=768, bias=True) (activation): Tanh() ) ) (dropout): Dropout(p=0.1, inplace=False) (classifier): Linear(in_features=768, out_features=5, bias=True) )
So, what is a BertModel? You can read the paper that introduced Bert here. This paper introduces the acronym 'Bert' for Bidirectional Encoder Representations from Transformers and introduces many details of the model and how it is trained.
Let's look at other NLP applications that are available with HuggingFaces.
m = sentiment.model
m.num_parameters()
167360261
f'{m.num_parameters():,}'
'167,360,261'
summarize = tr.pipeline("summarization")
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1621.0), HTML(value='')))
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1222317369.0), HTML(value='')))
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=898822.0), HTML(value='')))
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=456318.0), HTML(value='')))
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=26.0), HTML(value='')))
russell = '''
This book is about the past, present, and future of our attempt to understand and create intelligence.
This matters, not because AI is rapidly becoming a pervasive aspect of the present but because it is the dominant technology of the future.
The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time.
We cannot predict exactly how the technology will develop or on what timeline.
Nevertheless, we must plan for the possibility that machines will far exceed the human capacity for decision making in the real world.
What then?
Everything civilization has to offer is the product of our intelligence; gaining access to considerably greater intelligence would be the biggest event in human history.
The purpose of the book is to explain why it might be the last event in human history and how to make sure that it is not.
'''
summarize(russell, max_length=130, min_length=30, do_sample=False)
[{'summary_text': " This book is about the past, present, and future of our attempt to understand and create intelligence . The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time ."}]
summarize(russell, max_length=160, min_length=50, do_sample=False)
[{'summary_text': " This book is about the past, present, and future of our attempt to understand and create intelligence . The world's great powers are waking up to this fact, and the world's largest corporations have known it for some time . The purpose of the book is to explain why it might be the last event in human history and how to make sure it is not ."}]
translate_en_to_de = tr.pipeline('translation_en_to_de')
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=1199.0), HTML(value='')))
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=891691430.0), HTML(value='')))
Some weights of T5Model were not initialized from the model checkpoint at t5-base and are newly initialized: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=791656.0), HTML(value='')))
HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=230.0), HTML(value='')))
translate_en_to_de('Hello, my name is Chuck.')
[{'translation_text': 'Hallo, mein Name ist Chuck.'}]
translate_en_to_de(russell)
[{'translation_text': 'Dieses Buch befasst sich mit der Vergangenheit, der Gegenwart und der Zukunft unseres Versuchs, Intelligenz zu verstehen und zu schaffen. Dies ist wichtig, nicht weil KI schnell zu einem allgegenwärtigen Aspekt der Gegenwart wird, sondern weil es die dominierende Technologie der Zukunft ist. Die Großmächte der Welt erwachen auf diese Tatsache, und die größten Konzerne der Welt wissen dies seit einiger Zeit. Wir können nicht genau vorhersagen,'}]
Some of these notes are modifed from Buomsoo Kim's blog, which I have found very helpful in understanding NLP history and current algorithms. This article by Samuel Lynn-Evans is also very helpful.
Take a look at manythings.
import io
import zipfile
import re
from tqdm import tqdm # progress bar
import numpy as np
import torch
import matplotlib.pyplot as plt
!curl -O https://www.manythings.org/anki/deu-eng.zip
import io
import zipfile
with zipfile.ZipFile('deu-eng.zip') as zf:
with io.TextIOWrapper(zf.open('deu.txt'), encoding="utf-8") as f:
sentences = f.readlines()
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 8129k 100 8129k 0 0 10.6M 0 --:--:-- --:--:-- --:--:-- 10.6M
sentences[:10]
['Go.\tGeh.\tCC-BY 2.0 (France) Attribution: tatoeba.org #2877272 (CM) & #8597805 (Roujin)\n', 'Hi.\tHallo!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #380701 (cburgmer)\n', 'Hi.\tGrüß Gott!\tCC-BY 2.0 (France) Attribution: tatoeba.org #538123 (CM) & #659813 (Esperantostern)\n', 'Run!\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #906328 (papabear) & #941078 (Fingerhut)\n', 'Run.\tLauf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #4008918 (JSakuragi) & #941078 (Fingerhut)\n', 'Wow!\tPotzdonner!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122382 (Pfirsichbaeumchen)\n', 'Wow!\tDonnerwetter!\tCC-BY 2.0 (France) Attribution: tatoeba.org #52027 (Zifre) & #2122391 (Pfirsichbaeumchen)\n', 'Fire!\tFeuer!\tCC-BY 2.0 (France) Attribution: tatoeba.org #1829639 (Spamster) & #1958697 (Tamy)\n', 'Help!\tHilfe!\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #575889 (MUIRIEL)\n', 'Help!\tZu Hülf!\tCC-BY 2.0 (France) Attribution: tatoeba.org #435084 (lukaszpp) & #2122375 (Pfirsichbaeumchen)\n']
MAX_N_SENTENCES = 10000
MAX_SENTENCE_LENGTH = 10
eng_sentences, deu_sentences = [], []
eng_words, deu_words = set(), set()
for i in tqdm(range(MAX_N_SENTENCES)):
sentence_i = np.random.randint(len(sentences))
# find only letters in sentences
eng_sent, deu_sent = ["<sos>"], ["<sos>"] # start of sentence tag
eng_sent += re.findall(r"\w+", sentences[sentence_i].split("\t")[0])
deu_sent += re.findall(r"\w+", sentences[sentence_i].split("\t")[1])
# change to lowercase
eng_sent = [x.lower() for x in eng_sent]
deu_sent = [x.lower() for x in deu_sent]
eng_sent.append('<eos>') # end of sentence tag
deu_sent.append('<eos>')
# Add <pad> to end of sentences that are shorter than MAX_SENTENCE_LENGTH
if len(eng_sent) >= MAX_SENTENCE_LENGTH:
eng_sent = eng_sent[:MAX_SENTENCE_LENGTH]
else:
eng_sent.extend(['<pad>'] * (MAX_SENTENCE_LENGTH - len(eng_sent)))
if len(deu_sent) >= MAX_SENTENCE_LENGTH:
deu_sent = deu_sent[:MAX_SENTENCE_LENGTH]
else:
deu_sent.extend(['<pad>'] * (MAX_SENTENCE_LENGTH - len(deu_sent)))
# add parsed sentences
eng_sentences.append(eng_sent)
deu_sentences.append(deu_sent)
# update unique words
eng_words.update(eng_sent)
deu_words.update(deu_sent)
eng_words, deu_words = list(eng_words), list(deu_words)
100%|██████████| 10000/10000 [00:00<00:00, 62761.45it/s]
len(eng_words), len(deu_words)
(4546, 6913)
for i in range(10):
print()
print(eng_sentences[i])
print(deu_sentences[i])
['<sos>', 'tom', 'can', 't', 'believe', 'mary', 'let', 'herself', 'get', 'caught'] ['<sos>', 'tom', 'kann', 'nicht', 'glauben', 'dass', 'maria', 'sich', 'erwischen', 'ließ'] ['<sos>', 'even', 'though', 'tom', 'just', 'had', 'his', 'fortieth', 'birthday', 'i'] ['<sos>', 'tom', 'hatte', 'zwar', 'gerade', 'seinen', 'vierzigsten', 'geburtstag', 'ich', 'glaube'] ['<sos>', 'she', 'is', '35', 'years', 'old', 'and', 'in', 'the', 'prime'] ['<sos>', 'sie', 'ist', '35', 'und', 'in', 'ihren', 'besten', 'jahren', '<eos>'] ['<sos>', 'guess', 'who', 'i', 'am', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'rate', 'wer', 'ich', 'bin', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'even', 'tom', 'doesn', 't', 'know', 'mary', '<eos>', '<pad>', '<pad>'] ['<sos>', 'selbst', 'tom', 'kennt', 'maria', 'nicht', '<eos>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'tom', 'is', 'drinking', 'a', 'beer', '<eos>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'tom', 'trinkt', 'ein', 'bier', '<eos>', '<pad>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'i', 'work', 'as', 'many', 'hours', 'as', 'you', 'do', '<eos>'] ['<sos>', 'ich', 'arbeite', 'gleich', 'viele', 'stunden', 'wie', 'du', '<eos>', '<pad>'] ['<sos>', 'the', 'students', 'couldn', 't', 'answer', '<eos>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'antworten', 'konnten', 'die', 'studenten', 'nicht', '<eos>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'i', 'want', 'a', 'new', 'kitchen', '<eos>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'ich', 'will', 'eine', 'neue', 'küche', '<eos>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'i', 'll', 'dream', 'about', 'you', '<eos>', '<pad>', '<pad>', '<pad>'] ['<sos>', 'ich', 'werde', 'von', 'dir', 'träumen', '<eos>', '<pad>', '<pad>', '<pad>']
Convert each word into an integer index.
for i in tqdm(range(len(eng_sentences))):
eng_sentences[i] = [eng_words.index(x) for x in eng_sentences[i]]
deu_sentences[i] = [deu_words.index(x) for x in deu_sentences[i]]
100%|██████████| 10000/10000 [00:09<00:00, 1024.66it/s]
i = 10
print(eng_sentences[i])
print([eng_words[x] for x in eng_sentences[i]])
print(deu_sentences[i])
print([deu_words[x] for x in deu_sentences[i]])
[3201, 4155, 4116, 1527, 2592, 1849, 1427, 1119, 3499, 1480] ['<sos>', 'tom', 'paid', 'a', 'lot', 'of', 'money', 'for', 'that', 'guitar'] [6383, 3308, 4227, 3056, 2729, 4397, 2640, 1213, 303, 2962] ['<sos>', 'tom', 'hat', 'einen', 'haufen', 'geld', 'für', 'diese', 'gitarre', 'bezahlt']
class TransformerNet(torch.nn.Module):
def __init__(self, X_vocab_size, T_vocab_size, embedding_dim, n_hiddens, n_head, n_layers, dropout):
super().__init__()
self.enc_embedding = torch.nn.Embedding(X_vocab_size, embedding_dim)
self.dec_embedding = torch.nn.Embedding(T_vocab_size, embedding_dim)
self.transformer = torch.nn.Transformer(d_model = embedding_dim, nhead = n_head,
num_encoder_layers=n_layers, num_decoder_layers = n_layers,
dim_feedforward = n_hiddens, dropout = dropout)
self.dense = torch.nn.Linear(embedding_dim, T_vocab_size)
self.log_softmax = torch.nn.LogSoftmax(dim=2)
def forward(self, X, T):
src = self.enc_embedding(X)
tgt = self.dec_embedding(T)
Y = self.transformer(src, tgt)
return self.log_softmax(self.dense(Y))
ENG_VOCAB_SIZE = len(eng_words)
DEU_VOCAB_SIZE = len(deu_words)
HIDDEN_SIZE = 16
EMBEDDING_DIM = 30
NUM_HEADS = 2
NUM_LAYERS = 3
DROPOUT = True
DEVICE = 'cpu' # torch.device('cuda')
NUM_EPOCHS = 200
LEARNING_RATE = 1e-2
BATCH_SIZE = 128
n = 500
X = torch.tensor(eng_sentences[:n])
T = torch.tensor(deu_sentences[:n])
model = TransformerNet(ENG_VOCAB_SIZE, DEU_VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_SIZE, NUM_HEADS, NUM_LAYERS, DROPOUT).to(DEVICE)
nll_f = torch.nn.NLLLoss()
optimizer = torch.optim.Adam(model.parameters(), lr = LEARNING_RATE)
model
TransformerNet( (enc_embedding): Embedding(4546, 30) (dec_embedding): Embedding(6913, 30) (transformer): Transformer( (encoder): TransformerEncoder( (layers): ModuleList( (0): TransformerEncoderLayer( (self_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (linear1): Linear(in_features=30, out_features=16, bias=True) (dropout): Dropout(p=True, inplace=False) (linear2): Linear(in_features=16, out_features=30, bias=True) (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=True, inplace=False) (dropout2): Dropout(p=True, inplace=False) ) (1): TransformerEncoderLayer( (self_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (linear1): Linear(in_features=30, out_features=16, bias=True) (dropout): Dropout(p=True, inplace=False) (linear2): Linear(in_features=16, out_features=30, bias=True) (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=True, inplace=False) (dropout2): Dropout(p=True, inplace=False) ) (2): TransformerEncoderLayer( (self_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (linear1): Linear(in_features=30, out_features=16, bias=True) (dropout): Dropout(p=True, inplace=False) (linear2): Linear(in_features=16, out_features=30, bias=True) (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=True, inplace=False) (dropout2): Dropout(p=True, inplace=False) ) ) (norm): LayerNorm((30,), eps=1e-05, elementwise_affine=True) ) (decoder): TransformerDecoder( (layers): ModuleList( (0): TransformerDecoderLayer( (self_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (multihead_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (linear1): Linear(in_features=30, out_features=16, bias=True) (dropout): Dropout(p=True, inplace=False) (linear2): Linear(in_features=16, out_features=30, bias=True) (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm3): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=True, inplace=False) (dropout2): Dropout(p=True, inplace=False) (dropout3): Dropout(p=True, inplace=False) ) (1): TransformerDecoderLayer( (self_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (multihead_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (linear1): Linear(in_features=30, out_features=16, bias=True) (dropout): Dropout(p=True, inplace=False) (linear2): Linear(in_features=16, out_features=30, bias=True) (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm3): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=True, inplace=False) (dropout2): Dropout(p=True, inplace=False) (dropout3): Dropout(p=True, inplace=False) ) (2): TransformerDecoderLayer( (self_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (multihead_attn): MultiheadAttention( (out_proj): Linear(in_features=30, out_features=30, bias=True) ) (linear1): Linear(in_features=30, out_features=16, bias=True) (dropout): Dropout(p=True, inplace=False) (linear2): Linear(in_features=16, out_features=30, bias=True) (norm1): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm2): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (norm3): LayerNorm((30,), eps=1e-05, elementwise_affine=True) (dropout1): Dropout(p=True, inplace=False) (dropout2): Dropout(p=True, inplace=False) (dropout3): Dropout(p=True, inplace=False) ) ) (norm): LayerNorm((30,), eps=1e-05, elementwise_affine=True) ) ) (dense): Linear(in_features=30, out_features=6913, bias=True) (log_softmax): LogSoftmax() )
list(model.children())[0]
Embedding(4546, 30)
len(eng_words)
4546
list(model.children())[1]
Embedding(6913, 30)
len(deu_words)
6913
n_samples = X.shape[0]
likelihood_trace = []
for epoch in tqdm(range(NUM_EPOCHS)):
# Must train in batches to avoid exceeding memory capacity and crashing python.
loss = 0
for first_sample in range(0, n_samples, BATCH_SIZE):
last_sample = min(n_samples - 1, first_sample + BATCH_SIZE)
use_rows = slice(first_sample, last_sample)
Y = model(X[use_rows, :], T[use_rows, :])
nll = nll_f(Y.permute(0, 2, 1), T[use_rows, :])
optimizer.zero_grad()
nll.backward()
optimizer.step()
loss += nll
likelihood_trace.append((-loss).exp())
100%|██████████| 200/200 [01:15<00:00, 2.66it/s]
plt.plot(range(1, NUM_EPOCHS + 1), likelihood_trace)
plt.xlabel('Epoch')
plt.ylabel('Likelihood')
Text(0, 0.5, 'Likelihood')
def encoding_to_words(sentence, vocab):
return ' '.join(filter(lambda x: x != '<sos>' and x != '<pad>' and x != '<eos>',
[vocab[x] for x in sentence]))
n = 10
with torch.no_grad():
Y = model(X[:n, :], T[:n, :])
Y_sentences = Y.argmax(-1)
Y_sentences[:5]
tensor([[6383, 3308, 2704, 4119, 3575, 5705, 349, 1431, 2751, 2738], [6383, 3308, 6167, 869, 3138, 4353, 795, 3541, 5910, 4721], [6383, 1809, 6405, 2358, 5847, 4792, 4020, 6196, 2702, 4842], [6383, 5213, 3945, 5910, 1813, 4842, 4525, 4525, 4525, 4525], [6383, 4867, 3308, 4013, 349, 4119, 4842, 4525, 4525, 4525]])
for i in range(n):
print()
print(' Input:', encoding_to_words(X[i], eng_words))
print('Predicted:', encoding_to_words(Y_sentences[i], deu_words))
print(' Target:', encoding_to_words(T[i], deu_words))
Input: tom can t believe mary let herself get caught Predicted: tom kann nicht glauben dass maria sich erwischen ließ Target: tom kann nicht glauben dass maria sich erwischen ließ Input: even though tom just had his fortieth birthday i Predicted: tom hatte zwar gerade seinen vierzigsten geburtstag ich glaube Target: tom hatte zwar gerade seinen vierzigsten geburtstag ich glaube Input: she is 35 years old and in the prime Predicted: sie ist 35 und in ihren besten jahren Target: sie ist 35 und in ihren besten jahren Input: guess who i am Predicted: rate wer ich bin Target: rate wer ich bin Input: even tom doesn t know mary Predicted: selbst tom kennt maria nicht Target: selbst tom kennt maria nicht Input: tom is drinking a beer Predicted: tom trinkt ein bier Target: tom trinkt ein bier Input: i work as many hours as you do Predicted: ich arbeite gleich viele stunden wie du Target: ich arbeite gleich viele stunden wie du Input: the students couldn t answer Predicted: antworten konnten die studenten nicht Target: antworten konnten die studenten nicht Input: i want a new kitchen Predicted: ich will eine neue küche Target: ich will eine neue küche Input: i ll dream about you Predicted: ich werde von dir träumen Target: ich werde von dir träumen
Let's try some sentences that were not part of the training data.
n = 10
Xtest = torch.tensor(eng_sentences[-n:])
Ttest = torch.tensor(deu_sentences[-n:])
n = 10
with torch.no_grad():
Ytest = model(Xtest[:n, :], Ttest[:n, :])
Ytest_sentences = Ytest.argmax(-1)
for i in range(n):
print()
print(' Input:', encoding_to_words(Xtest[i], eng_words))
print('Predicted:', encoding_to_words(Ytest_sentences[i], deu_words))
print(' Target:', encoding_to_words(Ttest[i], deu_words))
Input: she accused him of stealing her money Predicted: sie scheine ihn ihr geld möglicherweise zu haben Target: sie beschuldigte ihn ihr geld gestohlen zu haben Input: tom has a foreign car Predicted: tom hat ein nur auto Target: tom hat ein ausländisches auto Input: we have more in common than i thought Predicted: wir haben mehr wer als ich dachte Target: wir haben mehr gemeinsamkeiten als ich dachte Input: please answer this question for me Predicted: bitte sie diese frage für mich Target: bitte beantworten sie diese frage für mich Input: my sister takes a shower every morning Predicted: meine schwester betrunken jeden morgen Target: meine schwester duscht jeden morgen Input: tom is healthy Predicted: tom ist frau Target: tom ist gesund Input: i don t care about your past Predicted: ihre kein mich nicht Target: ihre vergangenheit interessiert mich nicht Input: tom tried to discourage mary from going out with Predicted: tom finde maria davon wecke mit wie mathematische Target: tom versuchte maria davon abzubringen mit johannes auszugehen Input: we both know it s too late Predicted: wir wissen beide dass es zu hemd ist Target: wir wissen beide dass es zu spät ist Input: where s your money Predicted: wo ist ihr geld Target: wo ist ihr geld